Skip to content

SNOW-2105991: Use pre-computed row counts more aggressively#3358

Merged
sfc-gh-joshi merged 6 commits intomainfrom
joshi/aggressive-row-counts
May 21, 2025
Merged

SNOW-2105991: Use pre-computed row counts more aggressively#3358
sfc-gh-joshi merged 6 commits intomainfrom
joshi/aggressive-row-counts

Conversation

@sfc-gh-joshi
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi commented May 13, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-2105991

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

SNOW-1900040 (#3144) added row count estimation that propagates across pandas dataframe operations, including storing precise row count information for frames constructed from native pandas/python objects and frames created by read_snowflake. This PR extends that earlier work by propagating the precise row count value across certain ordered dataframe operations, namely select, union_all, and sort, which all have predictable effects on the size of the resulting frame. This PR also changes internal methods that retrieve the frame's row count (used in many operations for bounds checking or other validation) to use this cached row count value rather than issuing a query.

In short, retrieving the length of a dataframe created from native pandas/python or directly by read_snowflake will no longer issue an extra query. This reduces query counts across large parts of the test suite. Some notable affected APIs include repr, crosstab, insert, loc, iloc, iterrows, and groupby operations with by specified as a native list object.

I did not take benchmarks for most of these operations; the removal of a query should strictly represent an improvement. I did benchmark changes to the repr operation, as some more work was needed to take advantage of the cached row count for that API (following the approach taken in SNOW-1705797/#2760). At some point between 4/15 and 4/22 (dashboard metrics link; during this period the daily benchmark runner experienced some downtime due to modin versioning issues so the exact date is lost), repr for very large dataframes began taking almost twice as long as previously. I did not investigate the root cause, but this work remedies some of the performance impact.

Performance for repr(df) on this PR (b204452) vs. main (4b5feb)

10 cols100 cols2K cols
10 rows0.534s -> 0.592s (+10.8%) 0.648s -> 0.647s (-0.14%) 19.3s -> 7.9s (-59.0%)
1K rows 0.408s -> 0.358s (-12.2%) 0.681s -> 0.641s (-5.92%) 23.1s -> 8.82s (-61.8%)
1M rows 0.979s -> 0.698s (-28.7%) 2.17s -> 1.75s (-18.8%) 53.6s -> 36.7s (-31.5%)

@sfc-gh-snowflakedb-snyk-sa
Copy link

sfc-gh-snowflakedb-snyk-sa commented May 13, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@sfc-gh-joshi sfc-gh-joshi changed the title SNOW-#????: Use pre-computed row counts more aggressively SNOW-#2105991: Use pre-computed row counts more aggressively May 15, 2025
@sfc-gh-joshi sfc-gh-joshi added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label May 15, 2025
@sfc-gh-joshi sfc-gh-joshi marked this pull request as ready for review May 15, 2025 18:06
@sfc-gh-joshi sfc-gh-joshi requested a review from a team as a code owner May 15, 2025 18:06
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi/aggressive-row-counts branch 3 times, most recently from 07753f9 to 7197420 Compare May 19, 2025 18:40
Copy link
Contributor

@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! This might affect the perf regression stats eventually if the query benchmark hasn't already been removed yet I believe?

@sfc-gh-joshi
Copy link
Contributor Author

This LGTM! This might affect the perf regression stats eventually if the query benchmark hasn't already been removed yet I believe?

Yes, though in most cases the value would go down.

@sfc-gh-joshi sfc-gh-joshi changed the title SNOW-#2105991: Use pre-computed row counts more aggressively SNOW-2105991: Use pre-computed row counts more aggressively May 20, 2025
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi/aggressive-row-counts branch from 7197420 to 0b47722 Compare May 20, 2025 17:19
Co-authored-by: Mahesh Vashishtha <mahesh.vashishtha@snowflake.com>
@sfc-gh-joshi sfc-gh-joshi merged commit f3871ab into main May 21, 2025
66 of 69 checks passed
@sfc-gh-joshi sfc-gh-joshi deleted the joshi/aggressive-row-counts branch May 21, 2025 20:26
@github-actions github-actions bot locked and limited conversation to collaborators May 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants